[perl] Efficient processing of large text

Posted by jesper on Stack Overflow See other posts from Stack Overflow or by jesper
Published on 2010-04-06T20:54:07Z Indexed on 2010/04/06 21:13 UTC
Read the original article Hit count: 251

Filed under:

perl

|

large

|

text

|

efficiency

I have text file that contains over one million urls. I have to process this file in order to assign urls to groups, based on host address:

{
    'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
    'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}

My current basic solution takes about 600mb of RAM to do this (size of file is about 300mb). Could You provide some more efficient ways? My current solution simply reads line by line, extracts host address by regex and put url into hash.

© Stack Overflow or respective owner

Related posts about perl

Munin on Centos 6 - missing perl MODULE_COMPAT_5.8.8

as seen on Server Fault - Search for 'Server Fault'
I'm trying to install Munin on a new VPS through yum install munin but I keep getting an error about a missing perl module: Requires: perl(:MODULE_COMPAT_5.8.8). This is the perl version currently installed: v5.10.1. I've searched all around and still haven't found a solution for this. Here's the… >>> More
Pain removing a perl rootkit

as seen on Server Fault - Search for 'Server Fault'
So, we host a geoservice webserver thing at the office. Someone apparently broke into this box (probably via ftp or ssh), and put some kind of irc-managed rootkit thing. Now I'm trying to clean the whole thing up, I found the process pid who tries to connect via irc, but i can't figure out who's… >>> More
How To Avoid a Perl script calling an Another Perl Script

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, i am calling a perl script client.pl from a main script to capture the output of client.pl in @output. is there anyway to avoid the use of these two files so i can use the output of client.pl in main.pl itself here is my code.... main.pl ======= my @output = readpipe("client.pl"); client… >>> More
Perl :how to sort dates in perl

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, How can I sort the dates in perl. my @dates = ( "02/11/2009" , "12/20/2001" , "11/21/2010" ) ; I have above dates in my array . How can I sort those dates... ? My date format is dd/mm/YYYY. >>> More
please suggest a perl book exclusively for perl programs

as seen on Stack Overflow - Search for 'Stack Overflow'
I want tha name of a perl book for only PERL PROGRAMS. The reason behind is I want to improve my programming skill in perl >>> More

Related posts about large

Windows Seven, Large and Extra Large icons missing.

as seen on Super User - Search for 'Super User'
Some of my Large icons and extra large icons in windows 7 have gone blank. Nothing shows up in their place. The media icons are fine, thumbnails are created, tho My Computer can't show an icon for my drives when i ask it to show them in Large and Extra Large formats. Also I can't change my computer… >>> More
AJAX get data from large HTML page as the large HTML page loads

as seen on Stack Overflow - Search for 'Stack Overflow'
Not entirely sure whether this has a name but basically I have a large HTML page that is generated from results in a db. So viewing the HTML page (which is a report) in a browser directly does not display all contents immediately but displays what it has and additional HTML is added as the results… >>> More
Searching a large list of words in another large list

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a list of 1,000,000 strings with a maximum length of 256 with protein names. Every string has an associated ID. I have another list of 4,000,000,000 strings with a maximum length of 256 with words out of articles and every word has an ID. I want to find all matches between the list of protein… >>> More
Creating a multi-column rollover image gallery with HTML 5

as seen on ASP.net Weblogs - Search for 'ASP.net Weblogs'
I know it has been a while since I blogged about HTML 5. I have two posts in this blog about HTML 5. You can find them here and here.I am creating a small content website (only text,images and a contact form) for a friend of mine.He wanted to create a rollover gallery.The whole concept is that we… >>> More
MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems.

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi Guys, Versions I am running (basically latest of everything): PHP: 5.3.1 MySQL: 5.1.41 Apache: 2.2.14 OS: CentOS (latest) Here is the situation. I have thousands of very important documents, ranging from customer contracts to voice signatures (recordings of customer authorisation for contracts)… >>> More